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Luc Devroye, Gabor Lugosi and Gergely Neu 



Abstract 

We propose a version of the follow-the-perturbed-leader online prediction algorithm in which the 
cumulative losses are perturbed by independent symmetric random walks. The forecaster is shown 



o 

(N 



to achieve an expected regret of the optimal order 0(yJn\ogN) where n is the time horizon and 
N is the number of experts. More importantly, it is shown that the forecaster changes its prediction 



at most 0(y/n log N) times, in expectation. We also extend the analysis to online combinatorial 
optimization and show that even in this more general setting, the forecaster rarely switches between 
Lj^ ■ experts while having a regret of near-optimal order. 

m 

CN 1 Index Terms 
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I. Preliminaries 

In this paper we study the problem of online prediction with expert advice, see [1]. The 
problem may be described as a repeated game between a forecaster and an adversary — the 
environment. At each time instant t = 1, . . . , n, the forecaster chooses one of the N available 
actions (often called experts) and suffers a loss i { t e [0, 1] corresponding to the chosen action 
i. We consider the so-called oblivious adversary model in which the environment selects all 
losses before the prediction game starts and reveals the losses £ i)t at time t after the forecaster 
has made its prediction. The losses are deterministic but the forecaster may randomize: at 
time t, the forecaster chooses a probability distribution p t over the set of N actions and draws 
a random action I t according to the distribution p t . The prediction protocol is described in 
Figure Q] 

The usual goal for the standard prediction problem is to devise an algorithm such that 
the cumulative loss L n = Y2t=i ^h,t lS as small as possible, in expectation and/or with high 
probability (where probability is with respect to the forecaster's randomization). Since we 
do not make any assumption on how the environment generates the losses £ i:t , we cannot 
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Parameters: set of actions X = {1,2,..., iV}, number of rounds n; 

The environment chooses the losses E i;t E [0,1] for all i E {1,2, . . . , N} and t = 

1, . . . ,n. 

For all t = 1, 2, . . . , n, repeat 

1) The forecaster chooses a probability distribution p t over {1,2,..., N}. 

2) The forecaster draws an action randomly according to p t . 

3) The environment reveals £ i;t for all z G {1, 2, ... , N}. 

4) The forecaster suffers loss £i t)t - 

Fig. 1. Prediction with expert advice. 

hope to minimize the above cumulative loss. Instead, a meaningful goal is to minimize the 
performance gap between our algorithm and the strategy that selects the best action chosen 
in hindsight. This performance gap is called the regret and is defined formally as 



where we have also introduced the notation L* = mm ig {i j2 ,...,7v} Xw-Li ^i,t- Minimizing the 
regret defined above is a well-studied problem. It is known that no matter what algorithm the 
forecaster uses, 

. ER n 
lim mf sup — — > 1 

n,N^oc y/( n /2)lnN 

where the supremum is taken with respect to all possible loss assignments with losses in [0, 1] 
(see, e.g., HI). On the other hand, several prediction algorithms are known whose expected 
regret is of optimal order 0(y/n log N) and many of them achieve a regret of this order 
with high probability. Perhaps the most popular one is the exponentially weighted average 
forecaster (a variant of weighted majority algorithm of Littlestone and Warmuth [2], and 
aggregating strategies of Vovk fl3], also known as Hedge by Freund and Schapire [4]). The 
exponentially weighted average forecaster assigns probabilities to the actions that are inversely 
proportional to an exponential function of the loss accumulated by each action up to time t. 

Another popular forecaster is the follow the perturbed leader (FPL) algorithm of Hannan |0. 
Kalai and Vempala (6]| showed that Hannan's forecaster, when appropriately modified, indeed 
achieves an expected regret of optimal order. At time t, the FPL forecaster adds a random 
perturbation Z^t to the cumulative loss L i)t _i = X)!=i °f eacn action and chooses an action 
that minimizes the sum L i t _i + Z i:t . If the vector of random variables Z t = [Z\ t , . . . , Z Njt ) 
have joint density (r]/2) N e~ v ^ 1 for rj ~ \/log N/n, then the expected regret of the forecaster 
is of order 0(y/n\ogN) (0, see also Q, H, flU). This is true whether Z h . . . , Z n are 
independent or not. It they are independent, then one may show that the regret is concentrated 
around its expectation. Another interesting choice is when Z 1 — ■ ■ • — Z n , that is, the same 
perturbation is used over time. Even though this forecaster has an expected regret of optimal 
order, the regret is much less concentrated and may fail with reasonably high probability. 



n 
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Small regret is not the only desirable feature of an online forecasting algorithm. In many 
applications, on would like to define forecasters that do not change their prediction too 
often. Examples of such problems include the online buffering problem described by Geulen, 
Voecking and Winkler [ lOj and the online lossy source coding problem of Gyorgy and 
Neu [11 J. A more abstract problem where the number of abrupt switches in the behavior 
is costly is the problem of online learning in Markovian decision processes, as described by 
Even-Dar, Kakade and Mansour Q21 and Neu, Gyorgy, Szepesvari and Antos lfl3l . 

To be precise, define the number of action switches up to time n by 

C n = \{Kt<n:I t - 1 ^I t }\ . 

In particular, we are interested in defining randomized forecasters that achieve a regret R n of 
the order 0{^/n log N) while keeping the number of action switches C n as small as possible. 
However, the usual forecasters with small regret — such as the exponentially weighted average 
forecaster or the FPL forecaster with i.i.d. perturbations — may switch actions a large number 
of times, typically 0(n). Therefore, the design of special forecasters with small regret and 
small number of action switches is called for. 

The first paper to explicitly attack this problem is by Geulen, Voecking and Winkler ifTOl . 
who propose a variant of the exponentially weighted average forecaster called the "Shrinking 
Dartboard" algorithm and prove that it provides an expected regret of 0(y/n log N), while 
guaranteeing that the expected number of switches is at most 0(y/n log N). A less conscious 
attempt to solve the problem is due to Kalai and Vempala Q; they show that the simplified 
version of the FPL algorithm with identical perturbations (as described above) guarantees an 
0(y/n log N) bound on both the expected regret and the expected number of switches. In this 
paper, we propose a method based on FPL in which perturbations are defined by independent 
symmetric random walks. We show that this, intuitively appealing, forecaster has similar regret 
and switch-number guarantees as Shrinking Dartboard and FPL with identical perturbations. 
A further important advantage of the new forecaster is that it may be used simply in the 
more general problem of online combinatorial — or, more generally, linear — optimization. We 
postpone the definitions and the statement of the results to Section ITVl below. 

II. The algorithm 

To address the problem described in the previous section, we propose a variant of the 
Follow the Perturbed Leader (fpl) algorithm. The proposed forecaster perturbs the loss of 
each action at every time instant by a symmetric coin flip and chooses an action with minimal 
cumulative perturbed loss. More precisely, the algorithm draws independent random variables 
Xi t t that take values ±1/2 with equal probabilities and X i>t is added to each loss i^t-\- At 
time t action i is chosen that minimizes Yl s =i (A,t-i + ^ht) (where we define £ ij0 = 0). 

Equivalently, the forecaster may be thought of as an FPL algorithm in which the cumulative 
losses Li t-\ are perturbed by Z i)t = ^i=i^i,t- Since for each fixed i, Z^i, Z i>2 , ■ ■ ■ is a 
symmetric random walk, cumulative losses of the N actions are perturbed by N independent 
symmetric random walks. This is the way the algorithm is presented in Algorithm \T\ 
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Algorithm 1 Prediction by random-walk perturbation. 
Initialization: set Lj j0 = and Z ii0 — for all i — 1, 2, ... , iV. 
For all t — 1, 2, . . . , n, repeat 

1) Draw X itt for all i = 1, 2, . . . , N such that 

{I with probability | 
— | with probability |. 

2) Let Z M = Z i(t _i + X M for allz = 1, 2, . . . , N. 

3) Choose action 

It = argmin {L i t -i + Z i t ) . 

i 

4) Observe losses £ i>t for all i = 1, 2, . . . , N, suffer loss £ Iut . 

5) Set L i)t = L M _! + ^ for all i = 1, 2, . . . , N. 



A simple variation is when one replaces random coin flips by independent standard normal 
random variables. Both have similar performance guarantees and we choose ±(l/2)-valued 
perturbations for mathematical convenience. In Section [IV] we switch to normally distributed 
perturbations — again driven by mathematical simplicity. In practice both versions are expected 
to have a similar behavior. 

Conceptually, the difference between standard FPL and the proposed version is the way 
the perturbations are generated: while common versions of FPL use perturbations that are 
generated in an i.i.d. fashion, the perturbations of the algorithm proposed here are dependent. 
This will enable us to control the number of action switches during the learning process. Note 
that the standard deviation of these perturbations is still of order \/t just like for the standard 
FPL forecaster with optimal parameter settings. 

To obtain intuition why this approach will solve our problem, first consider a problem with 
N = 2 actions and an environment that generates equal losses, say £ itt = for all i and t, for 
all actions. When using i.i.d. perturbations, FPL switches actions with probability 1/2 in each 
round, thus yielding C t = t/2 + 0(y/i) with overwhelming probability. The same holds for 
the exponentially weighted average forecaster. On the other hand, when using the random- 
walk perturbations described above, we only switch between the actions when the leading 
random walk is changed, that is, when the difference of the two random walks — which is 
also a symmetric random walk — hits zero. It is a well known that the number of occurrences 
of this event up to time t is O p (y/i), see, [fT4l|. As we show below, this is the worst case for 
the number of switches. 

III. Performance bounds 

The next theorem summarizes our performance bounds for the proposed forecaster. 

Theorem 1: The expected regret and expected number of switches of actions of the fore- 
caster of Algorithm Q] satisfy, for all possible loss sequences (under the oblivious-adversary 
model), 

ER n < 2EC n < 8^/271 logiV + 16 logn + 16 . 
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Remark. Even though we only prove bounds for the expected regret and the expected number 
of switches, it is of great interest to understand upper tail probabilities. However, this is a 
highly nontrivial problem. One may get an intuition by considering the case when N = 2 and 
all losses are equal to zero. In this case the algorithm switches actions whenever a symmetric 
random walk returns to zero. This distribution is well understood and the probability that this 
occurs more than Xy/n times during the first n steps is roughly 2¥{N > 2x} < 2e~ 2x where 
iV is a standard normal random variable (see jfl4l Section III.4]). Thus, in this case we see 
that the number of switches is bounded by O \ \/n log(l/<5) j , with probability at least 1 — 5. 
However, proving analog bounds for the general case remains a challenge. 

To prove the theorem, we first show that the regret can be bounded in terms of the number 
of action switches. Then we turn to analyzing the expected number of action switches. 



A. Regret and number of switches 

The next simple lemma shows that the regret of the forecaster may be bounded in terms 
of the number of times the forecaster switches actions. 
Lemma 1: Fix any i G {1,2,..., N}. Then 

n+1 
t=l 

Proof: We apply Lemma 3.1 of [1J (sometimes referred to as the "be-the-leader" lemma) 
for the sequence {l.,t-x + X. tt )^i with ij y0 = for all j G {1, 2, . . . , N}, obtaining 

n+1 n+1 

E (h,t-i + < E (Vi + x *,t) 
t=i t=i 

Reordering terms, we get 

n n+1 n+1 

£4,* < Li, n + ih-ut-i - ti t ,t-i) + z i>n -J2 x h,t ■ (i) 
t=\ t=i t=i 

The last term can be rewritten as 

n+1 n+1 n+1 

_ x h,t = - x h-x,t + i Xl t-ut - X Iut ) . 
t=i t=i t=i 

Now notice that Xi t _ ltt — Xj t<t and £i t _ u t-i — @i t ,t-i are both zero when I t = 7 t _j and are 
upper bounded by 1 otherwise. That is, we get that 

n+1 n+1 n+1 

E (Vi,w - w-0 + E ( Xi ^ - x ^) ^ 2 E 1 ^ * ^ = 2C - ■ 

t=i t=i t=i 

Putting everything together gives the statement of the lemma. ■ 
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B. Bounding the number of switches 

Next we analyze the number of switches C n . In particular, we upper bound the marginal 
probability P [It+i ^ h] for each t > 1. We define the lead pack A t as the set of actions that, 
at time t, have a positive probability of taking the lead at time t + 1: 

A t = jz e {1, 2, . . . , N} : L M _x + Z M < nun {L jjt -i + %) + 2 j . 

We bound the probability of lead change as 

F[I t ^I t+1 ]<h»[\A t \>l] . 

The key to the proof of the theorem is the following lemma that gives an upper bound for 
the probability that the lead pack contains more than one action. It implies, in particular, that 



E[C„] <4 v / 2nlogA^ + 41ogn + 4 , 

which is what we need to prove the expected- value bounds of Theorem [T] 
Lemma 2: 

Proof: Define p t (k) = P [Zu = §] for all k = -t, . . . , t and we let St denote the set of 
leaders at time t (so that the forecaster picks I t E S t arbitrarily): 

St = [j E {1, 2, . . . , N} : L j)t -i + Z jtt = min {L^ + Z i>t }} . 



k 

min {Li t-i + Z i t ) > L j t -i + - + 2 

ie{i,2,...,7V}\j 2 



Let us start with analyzing P [\At\ = 1]: 

t N 

p[i^i = i] = 5^53 ft (*)p 

k=-t j=l 
t-4 N 

k=-t j=i 
t N 

k=-t+A j=l 

Before proceeding, we need to make two observations. First of all, 



k + 4 

min {L ijt -i + Z i t } > Lj, t -i H — 

ie{i,2,...,N}\j 2 



min {L i)t -i + Z ijt } > L j t -i + - 
ie{i,2,...,N}\j 2 



Pt(k) 
Pt(k + 4) 

Pt(k -4) 
Pt(k) 



N 

j=i 



k 

min {L i>t -i + Z i t } > L j t -i + tt 
ie{i,2,...,N}\j 2 



> P 



3j E S t : Z jtt 



> P 



min Z 

jeSt 



j,t 
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where the first inequality follows from the union bound and the second from the fact that 
the latter event implies the former. Also notice that Z ijt + | is binomially distributed with 
parameters t and 1/2 and therefore p t (k) = ( t +k) h.- Hence 



Ptjk - 4) 
Pt(k) 



(i±*-2)!(*=* + 2)! 
4(t + !)(*; -2) 



1 + 



(f-ife + 2)(t-A; + 4) 



It can be easily verified that 



4(f + l)(fc-2) 



> 



4(t + l)(fc-2) 
(t - fc + 2)(t - fc + 4) '- (t + 2)(t + 4) 

holds for all k G [— £, £]. Using our first observation, we get 

t 

F[\A t \ = l]>J2 E p»(*) p 

j k=-t+A 



k 

min {Li,t-i + Z iyt } > L j:t _i + - 
ie{i,2,...,N}\j 2 



Pt(fc ~ 4) 
Pt(k) 



t 



min Z. 



A: 



Pt{k - 4) 
Pt(k) 



> E p 

fe=-t+4 

Along with our second observation, this implies 

t 



ppti > i] <i - E p 

fc=-t+4 
t 

<!- E p 



min Z 

jes t 



mm Zj t — — 

jest h 2 



Pt(k - 4) 
Pt(k) 

1 + 4(f + l)(fc-2) 



<E 

fc=-t 

8(t + 1 



minZ it = — 

jes t ht 2 



(f + 2)(i + 4) 

4(2-fc)(t+lQ 
(t + 2)(* + 4) 

t + 1 



(t + 2)(t + 4) (t + 2)(t + 4) 



E 



min Z 

jeSt 



<- + -E 
~t t 



max Z 



ie{i,2,...,JV} 



Now using E [max^ Z,- jt ] < \J~^^- implies 



P[K |>H<4^ + | 



as desired. 
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IV. Online combinatorial optimization 

In this section we study the case of online linear optimization (see, among others, [[151 . 
If!6l |[T7l, ED, 0, 0U, El, (ED, flU, fl23l, flU). This is a similar prediction problem 
as the one described in the introduction but here each action i is represented by a vector 
Vi E M. d . The loss corresponding to action i at time t equals vj£ t where £ t £ [0, l] d is the 
so-called loss vector. Thus, given a set of actions S = {vi : % — 1, 2, . . . , iV} C ]R d , at every 
time instant t, the forecaster chooses, in a possibly randomized way, a vector 1^ G S and 
suffers loss VJ T £t. We denote by L n = Ylt=i K ^ me cumulative loss of the forecaster and 
the regret becomes 

L n — min v T L n 

where L t = Yll=i is the cumulative loss vector. While the results of the previous sec- 
tion still hold when treating each V{ E S as a separate action, one may gain important 
computational advantage by taking the structure of the action set into account. In particular, 
as emphasize, FPL-type forecasters may often be computed efficiently. In this section we 
propose such a forecaster which adds independent random- walk perturbations to the individual 
components of the loss vector. To gain simplicity in the presentation, we restrict our attention 
to the case of online combinatorial optimization in which S C {0, 1} , that is, each action is 
represented a binary vector. This special case arguably contains most important applications 
such as the online shortest path problem. In this example, a fixed directed acyclic graph of d 
edges is given with two distinguished vertices u and w. The forecaster, at every time instant t, 
chooses a directed path from u to w. Such a path is represented by it binary incidence vector 
v E {0, l} d . The components of the loss vector £ t E [0, l] d represent losses assigned to the d 
edges and v J £ t is the total loss assigned to the path v. Another (non-essential) simplifying 
assumption is that every action v E S has the same number of l's: ||t>||i = m for all v E S. 
The value of m plays an important role in the bounds below. 

The proposed prediction algorithm is defined as follows. Let Xi, . . . , X n be independent 
Gaussian random vectors taking values in M d such that the components of each X t are i.i.d. 
normal X ijt ~ A/"(0, rf) for some fixed r] > whose value will be specified later. Denote 

t 

s=l 

The forecaster at time t, chooses the action 

V t = argmin {v T {L t -i + Z t )\ , 

where L t = ^* =1 £ t for t > 1 and L = {0,..., 0) T . 

The next theorem bounds the performance of the proposed forecaster. Again, we are not 
only interested in the regret but also the number of switches 52t=i ^ {Vt+i ^ Vt}- The regret is 
of similar order — roughly my/dn — as that of the standard FPL forecaster, up to a logarithmic 
factor. Moreover, the expected number of switches is O (m 2 (log<i) 5//2 A/n). Remarkably, the 
dependence on d is only polylogarithmic and it is the weight m of the actions that plays an 
important role. 
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We note in passing that the Shrinking Dartboard algorithm of [ITOll can be used for si- 
multaneously guaranteeing that the expected regret is 0(m 3//2 i/n \ogd) and the expected 
number of switches is y/mn log d. However, as this algorithm requires explicit computation 
of the exponential weighted forecaster, it can only be efficiently implemented for some special 
decision sets S — see |j22"| and [|23l for some examples. On the other hand, our algorithm can 
be efficiently implemented whenever there exists an efficient implementation of the static 
optimization problem of finding argmin„ gl 5 v T £ for any £ 6 M. d . 

Theorem 2: Fix any v G S. The expected regret and the expected number of action switches 
satisfy (under the oblivious adversary model) 

md(logn + 1) 



E 

and 



L n - v T L n < vn^fn (— + r}J2 log d ) + 
\V J 



rf 



« m (l + 2rj (21ogrf+ v/21ogd+ 1) + rj 2 (21ogd + V21ogd + lV 
E £l{W w 



t=l t=l 

n 



+ ^ m ( 1 + (2 log d + ^/2hg~d + l) ) Tf2hg~d 



t=i 



In particular, setting rj — ^/ d y^ e ^ s 



EL n — v T L n < AmVdn-\/\ogd + m(log n+1) \/log d. 

and 

n 

E^I{V m + V t } = O (m{\ogdf/ 2 ^i) . 
t=i 

The proof of the regret bound is quite standard, similar to the proof of Theorem 3 in ll25l . 
and is deferred to the appendix. The more interesting part is the bound for the expected 
number of action switches E 2~2t=i {Vs+i Vt} = 2~2t=i ^ [Vt+i Vt]- It follows from the 
lemma below and the well-known fact that the expected value of the maximum of the square 
of d independent standard normal random variables is at most 2 log d + \/2 log d + 1 (see, 
e.g., 112611 ). Thus, it suffices to prove the following: 

Lemma 3: For each t = 1, 2, . . . , n, 

ip \v 4VIT 1 <r m W £ t + X w\\lo M ^ll^ + ^mllooV / 21ogd 
P [V t+1 + V t \X t+1 \ < _ + _ 

Proof: We use the notation F t [■] = P [■ |X t+1 ] and E t [■] = E [■ \X t+1 ]. Also, let 



t-i 



h t = £ t + X t+1 and H t = ^2h t 

s=0 

Furthermore, we will use the shorthand notation c = H^tH^. Define the set A t as the lead 
pack: 

A t = {w e S : (w - V t ) T H t < \\w - VtW, c] . 
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Observe that the choice of c guarantees that no action outside A t can take the lead at time 
t + 1, since if w ^ A t , then 



(w-V t ) T H t > \(w-V t ) T h t \ 



so (w — V t y H t +i > and w cannot be the new leader. It follows that we can upper bound 
the probability of switching as 

Pt [V t+1 ^V t ]<F t [\A t \>l], 

which leaves us with the problem of upper bounding F t [\A t \ > 1]. Similarly to the proof of 
Lemma [21 we start analyzing F t [\A t \ = 1]: 



P t [\A t \ = 1} = ^¥ t [\/w ^ v. (w - v) T H t > \\w-v\li_c] 



ves 



J MyWt [Vw ^ v : w T H t >y+\\w-v\\ lC \v T H t = y] dy, 

veS yeR 



(2) 



where f v is the distribution of v T H t . Next we crucially use the fact that the conditional 
distributions of correlated Gaussian random variables are also Gaussian. In particular, defining 
k(w, v) = (m — \\w — v\\i), the covariances are given as 



cov 



(w 1 H t , v 1 Ht) — f] 2 ( m ~ \\ w ~ v \\i)t = rfkiw, v)t. 



Let us organize all actions w E S\v into a matrix W = (io 1; w 2 , ■ ■ ■ , lOjy-i)- The conditional 
distribution of W T H t is an (N — l)-variate Gaussian distribution with mean 

/ T k(w h v) T k(w 2 ,v) T fc(«7 W -i,«) N 

\ m m m 

and covariance matrix E„, given that v T H t = y. Defining K = (k(u>i, v), . . . , k{w N _i, v)) J 
and using the notation ip(x) = , == exp(— ^-), we get that 

Pt [Vic 7^ f : w T H t > y + \\w — v^c \ v T H t = y~\ 

oo 

= /■"/ ^ (\A 2 ~ ^^)) TS ^ ~ m?/))) rf2 

2i=2/+(m— fc(u>j,l>))c 
OO 

= I I ^(y/( z ~ - cK ) T ^y 1 ( z - Vviv) - cK)^j dz 

Zi=y+(m—k(wi,v))c+k(wi,v)c 
oo 

= J "1 (\A Z ~ ^ y + mc )) T s y 1 ( z - Vv{y + mc))j 
= Pt [Vtu 7^ : w T H t > y + mc\ v T H t = y + mc] , 
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where we used fi y + m c = fJ>y + cK. Using this, we rewrite © as 

F t [\ A t \ = 1] = J MvWt [\fw^v: w T H t > y\ v T H t = y] dy 

veS y m 

- E / ) - M - mc )) p * i Ww ^ v ■ w Ht ^ y\ vTRt = y] d y 

veS yeR 

=1 - E / (fM - f«(v - mc )) p * [ v ™ + v ■ w Ht ^ y\ v Ht = y\ d y- 

v&s y m 

To treat the remaining term, we use that v T H t is Gaussian with mean v 1 L t -i and standard 
deviation ri\frrd and obtain 

Mv - mc)' 



fv(y) - My- mc) =Mv) ( 1 
<Mv) 



My) 

mc 2 c(y — v 7 L t _i) 



Thus 



2rft rft 

p* u t \ > i] < E / (fM - f«(y - mc )) p * [ v ™ + v ■ w Ht ^ y\ vTh * = y\ d y 



mc 2 cE [ V t T Z t ] ^ , mc 2 mcE [\\Z t \ 
2rft rft ~ 2rft rft 

m \\ h t\\lo + m \\ h t\\oo V2 logci 



2r7 2 t v Vt 



where we used the definition of c and E [H-ZtHoJ < r/y/2t\ogd in the last step. 
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Appendix 

Proof of the first statement of Theorem^ The proof is based on the proof of Theorem 4.2 
of [HI and Theorem 3 of [1251 . The main difference from those proofs is that the standard 
deviation of our perturbations changes over time, however, this issue is very easy to treat. First, 
we define an infeasible "forecaster" that peeks one step into the future and uses perturbation 

Z t = VtX x : 

V t = arg min w T [L t + Z t ). 
Using Lemma 3.1 of [1 J, we get 

n 

^2 V t T (£ t + (Z t - Z t _0) < v T (L n + Z n ). 

t=i 
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After reordering, we obtain 

n n n 

V t^ < v T L n + v T Z n + ^(V t ~ VtVtt - Vt T (Zt - Z t -i) 
t=i t=i t=i 

n n 

= v T L n + v T Z n + J2(V t - V t ) T £ t + ^(v 7 ^! - Vt)V t T X 1 
The last term can be bounded as 



t=i 



t=i 



^(v 7 ^! - Vt) v t T x 1 < J2(Vt - Vt=T) 



t=i 



t=i 



v t T x 1 



< m J2(Vt- y/t=T) \\x ± \ 



t=l 



<m\fn \X\ 



Taking expectations, we obtain the bound 



E 



n _ 

L n ] - v T L n <J2 E [i V t ~ Vt) T £t] + vmV^nhgd, 
t=i 



where we used E [H-X^H^] < r]y/2 \ogd. That is, we are left with the problem of bounding 



E 



(v t - v t ) T e t ^ 

To this end, let 



for each t > 1 . 



viz) = argmin'u; z 



for all z G R d , and also 



F t (z)=v(z) T £ t . 

Further, let f t (z) be the density of Z t , which coincides with the density of Z t . We have 

E[V t T £ t ] =E[F t (L t . 1 + Z t )] 

= [ f t (z)F t (L t _ 1 + z)dz 

= [ f t (z)F t (L t -£ t + z)dz 

= [ f t (z + £ t )F t (L t + z)dz 

=E 



F t (L t + Z t ) + I (f t (z + l t ) - f t (z)) F(L t + z) dz 



=E 



V?£ t 



(f t (z)-f t (z-£ t ))F(L t _ 1 + z)dz. 



14 



The last term can be upper bounded as 



/ / t (z)(l-exp( 



[z-£ t ) T e t 



< - 



rft 
z - £ t ) T £ t 
rft 



F t (L t -i + z) dz 
F(L t _i + z)dz 



< 



®[V t T £ t ] \\£t\\ 



rft 
md m 

— ^2£ 



+ ^-[ f t (z)\z T £ t \dz 



f t (z) 1| ^ 1| ! dz 



md 1 2 md 
+ \ ~ 



rft V 7r rjVt ' 



where we used E [H-ZtH-J = rjd^2t/-K in the last step. Putting everything together, we obtain 
the statement of the theorem as 



E 



-r md /2 md , — : 

< + rjm^^d + md(l0g " + 1} . 
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